ARMA/ARIMA/SARIMA MODEL
During the Exploratory Data Analysis session, through ACF and Augmented Dickey-Fuller Test we are able to find that the market breadth data is stationary, thus no need for differencing and log transformed at first time.
1 ARIMA Models
1.1 Check for stationarity
Code
Augmented Dickey-Fuller Test
data: spx_breadth_ts
Dickey-Fuller = -6.8663, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary
Again, both the ACF plot and the Augmented Dickey-Fuller test shows that the market breadth data is stationary, thus no further differencing needed. Since we didn’t differenced the data, we will be using ARMA model, and from the plot we could clearly see that p=1 and q=0, observed from the AR and MA model
1.2 Model FItting (ARMA)
Code
# Initialize the ARIMA results list
ARMA_res <- list()
# Set a counter
cc <- 1
# Loop over AR (p) values
for (p in 1:3) {
# Loop over MA (q) values
for (q in 0:1) {
# Fit the ARIMA model
ARMA_res[[cc]] <- arima(x = spx_breadth_ts, order = c(p, 0, q))
# Increment the counter
cc <- cc + 1
}
}
# Get AIC and BIC values for model evaluation
ARMA_AIC <- sapply(ARMA_res, function(x) x$aic)
ARMA_BIC <- sapply(ARMA_res, function(x) x$bic)
# Find the index of the minimum AIC value
best_model <- ARMA_res[[which(ARMA_AIC == min(ARMA_AIC))]]
# Print the best model
print(best_model)
Call:
arima(x = spx_breadth_ts, order = c(p, 0, q))
Coefficients:
ar1 ar2 ar3 ma1 intercept
1.8353 -0.7963 -0.0497 -0.9447 600.5391
s.e. 0.0438 0.0762 0.0374 0.0250 20.9535
sigma^2 estimated as 12081: log likelihood = -4626.66, aic = 9265.32
Breadth(t) = 600.5391 + 1.8353 * Breadth(t-1) - 0.7963 * Breadth(t-2) - 0.0497 * Breadth(t-3) - 0.9447 * ε(t-1) + ε(t)
1.3 Model Diagonsis
Code
Ljung-Box test
data: Residuals from ARIMA(3,0,1) with non-zero mean
Q* = 129.38, df = 147, p-value = 0.8491
Model df: 4. Total lags used: 151
Code
# Additional residual plots
par(mfrow = c(2, 2))
# Histogram of residuals
hist(best_model$residuals, breaks = 20, main = "Histogram of Residuals", xlab = "Residuals", col = "lightblue", border = "black")
# Kernel density plot of residuals
plot(density(best_model$residuals), main = "Kernel Density Plot of Residuals", xlab = "Residuals", ylab = "Density", col = "blue")
rug(best_model$residuals, col = "black")
# Q-Q plot of residuals
qqnorm(best_model$residuals, main = "Q-Q Plot of Residuals", col = "blue")
qqline(best_model$residuals, col = "red", lwd = 2)
# Scatter plot of residuals vs. fitted values
plot(fitted(best_model), best_model$residuals, main = "Residuals vs. Fitted Values", xlab = "Fitted Values", ylab = "Residuals", col = "blue", pch = 20)
abline(h = 0, col = "red", lwd = 2)By checking the residual of the model ARIMA(3,0,1) we can see that the model did a good job of capturing the residual and the movement of the dataset as the distribution of the residual is normally distributed and most of the residuals follows the q-q plot.
1.4 Compare with Auto.ARima()
Series: spx_breadth_ts
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 mean
0.9101 591.6137
s.e. 0.0152 44.2257
sigma^2 = 12305: log likelihood = -4632.51
AIC=9271.02 AICc=9271.05 BIC=9284.9
The result of the auto arima is different from the model I choose, which I believe there could be several cause of this:
Different criteria: While I used the AIC value to select the best ARIMA model in your manual search, auto.arima also considers AICc (corrected AIC) and BIC (Bayesian Information Criterion) values in its model selection process. These criteria may lead to different model selections.
Stepwise procedure: The auto.arima function uses a stepwise search algorithm to find the best model, which starts with a simple model and adds or removes terms based on the AIC value. This stepwise approach is computationally efficient, but it may not explore all possible combinations of p, d, and q, and thus might not find the same model as my manual search.
1.5 ARIMA model forecast
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2022.3348 850.0667 708.7395 991.3939 633.9254 1066.208
2022.3376 844.7766 655.5244 1034.0288 555.3404 1134.213
2022.3403 837.5736 614.3001 1060.8470 496.1062 1179.041
2022.3431 828.6691 579.6969 1077.6414 447.8989 1209.439
2022.3458 818.3257 549.4260 1087.2254 407.0791 1229.572
2022.3485 806.7914 522.3208 1091.2619 371.7312 1241.852
2022.3513 794.3017 497.6918 1090.9116 340.6760 1247.927
2022.3540 781.0788 475.0914 1087.0663 313.1114 1249.046
2022.3567 767.3301 454.2068 1080.4533 288.4494 1246.211
2022.3595 753.2477 434.8075 1071.6879 266.2354 1240.260
We can see the model was able to provide a good forecast for months but not good enough for later changes.
1.6 Compare with benchmark
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.2061896 109.9129 82.16254 -Inf Inf 0.2886379 -0.002412457
ME RMSE MAE MPE MAPE MASE ACF1
Training set -66.77762 359.1479 290.0464 -145.9626 179.2656 1.018936 0.8979772
By comparing the RMSE, my manual model clearly outperform the benchmark naive model.
Code
autoplot(spx_breadth_ts) +
autolayer(meanf(spx_breadth_ts, h=11),
series="Mean", PI=FALSE) +
autolayer(naive(spx_breadth_ts, h=11),
series="Naïve", PI=FALSE) +
autolayer(forecast(best_model, h=11),
series="FInal ARIMA model", PI=FALSE) +
ggtitle("Forecasts for SPX market breadth") +
xlab("Year") + ylab("Market Breadth") +
guides(colour=guide_legend(title="Forecast"))By visualizting the forecast, it is also clear that the model has outperform the benchmark models.
2 SARIMA
2.1 Check for Seasonality
By decomposing the market breadth time series, I can’t observed significant seasonal effect with in the data, which is normal for the stock market breadth that measure the direction of market movement rather than the overall movements. However we can still trying to fit the SARIMA model to the market breadth to see if there exist any correlation there.
2.2 Fit an appropriate ARIMA model
From this we can see that there’s no need for a second order differencing. And ACF plot suggest q=1,3 and PACF plot suggest p=1,3,4, which d=1.
Code
d=1
i=1
temp= data.frame()
ls=matrix(rep(NA,6*12),nrow=12) # roughly nrow = 3x2x2
for (p in 2:4)# p=1,2,3 : 3
{
for(q in 2:3)# q=1,3,4 :2
{
for(d in 1:2)# d=1,2 :2
{
if(p-1+d+q-1<=8)
{
model<- Arima(spx_breadth_ts,order=c(p-1,d,q-1),include.drift=TRUE)
ls[i,]= c(p-1,d,q-1,model$aic,model$bic,model$aicc)
i=i+1
#print(i)
}
}
}
}
temp= as.data.frame(ls)
names(temp)= c("p","d","q","AIC","BIC","AICc")
#temp
knitr::kable(temp)| p | d | q | AIC | BIC | AICc |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 9289.878 | 9308.385 | 9289.931 |
| 1 | 2 | 1 | 9284.062 | 9297.938 | 9284.094 |
| 1 | 1 | 2 | 9291.797 | 9314.930 | 9291.877 |
| 1 | 2 | 2 | 9285.345 | 9303.846 | 9285.398 |
| 2 | 1 | 1 | 9291.809 | 9314.942 | 9291.889 |
| 2 | 2 | 1 | 9285.487 | 9303.989 | 9285.541 |
| 2 | 1 | 2 | 9293.877 | 9321.637 | 9293.989 |
| 2 | 2 | 2 | 9287.197 | 9310.324 | 9287.278 |
| 3 | 1 | 1 | 9263.564 | 9291.324 | 9263.676 |
| 3 | 2 | 1 | 9286.481 | 9309.608 | 9286.562 |
| 3 | 1 | 2 | 9265.420 | 9297.807 | 9265.570 |
| 3 | 2 | 2 | 9267.954 | 9295.707 | 9268.067 |
p d q AIC BIC AICc
9 3 1 1 9263.564 9291.324 9263.676
p d q AIC BIC AICc
9 3 1 1 9263.564 9291.324 9263.676
p d q AIC BIC AICc
9 3 1 1 9263.564 9291.324 9263.676
Thus the best model is p=3, q=1, and d=1.
Code
We can see that the model 1 doing a pretty good fit that getting from the result.
2.3 Model forecast with CV
Series: spx_breadth_ts
ARIMA(3,1,1) with drift
Coefficients:
ar1 ar2 ar3 ma1 drift
0.9067 0.0720 -0.0727 -1.0000 -0.0251
s.e. 0.0363 0.0489 0.0365 0.0041 0.1901
sigma^2 = 12303: log likelihood = -4625.78
AIC=9263.56 AICc=9263.68 BIC=9291.32
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 4.550812 110.4763 81.03106 Inf Inf 0.284663 -0.00238004
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2022.3348 823.7815 681.5435 966.0194 606.2473 1041.316
2022.3376 799.7081 607.5853 991.8309 505.8817 1093.535
2022.3403 777.0914 546.5002 1007.6827 424.4325 1129.750
2022.3431 756.9116 499.3430 1014.4803 362.9943 1150.829
2022.3458 738.7348 460.9545 1016.5151 313.9065 1163.563
2022.3485 722.4437 429.3544 1015.5330 274.2022 1170.685
2022.3513 707.8291 402.9161 1012.7422 241.5048 1174.153
2022.3540 694.7249 380.5824 1008.8675 214.2853 1175.164
2022.3567 682.9736 361.5617 1004.3856 191.4165 1174.531
2022.3595 672.4359 345.2613 999.6105 172.0655 1172.806
2.4 Model comparision with benchmark
Ljung-Box test
data: Residuals from Mean
Q* = 3833.8, df = 151, p-value < 2.2e-16
Model df: 0. Total lags used: 151
Ljung-Box test
data: Residuals from Naive method
Q* = 142.56, df = 151, p-value = 0.676
Model df: 0. Total lags used: 151
Code
Clearly that my model has outperformed the benchmark model.